{2025.06}[2024a] PyTorch 2.9.1#1389
Conversation
|
bot: build repo:eessi.io-2025.06-software instance:eessi-bot-mc-aws for:arch=x86_64/amd/zen4 |
|
New job on instance
|
|
New job on instance
|
|
bot: build repo:eessi.io-2025.06-software instance:eessi-bot-mc-aws for:arch=x86_64/amd/zen4 |
|
New job on instance
|
|
bot: build repo:eessi.io-2025.06-software instance:eessi-bot-mc-aws for:arch=x86_64/amd/zen4 |
|
New job on instance
|
Errors are quite similar to the ones observed in #1314, many of these: |
|
Updated hooks file with a fix for PyTorch has been ingested (EESSI/software-layer-scripts#172), let's try again. bot: build repo:eessi.io-2025.06-software instance:eessi-bot-aws-eu-south for:arch=x86_64/amd/zen5 |
|
New job on instance
|
|
New job on instance
|
|
New job on instance
|
|
The neoverse v1 build ran out of memory: |
|
I've modified bot: build repo:eessi.io-2025.06-software instance:eessi-bot-mc-aws for:arch=aarch64/neoverse_v1 |
|
New job on instance
|
|
bot: build repo:eessi.io-2025.06-software instance:eessi-bot-jsc for:arch=aarch64/nvidia/grace |
|
New job on instance
|
|
bot: build repo:eessi.io-2025.06-software instance:eessi-bot-mc-aws for:arch=aarch64/generic |
|
New job on instance
|
|
New job on instance
|
|
bot: build repo:eessi.io-2025.06-software instance:eessi-bot-mc-aws for:arch=aarch64/neoverse_v1 |
|
New job on instance
|
|
No more memory issues for the neoverse v1 build, but too many failing tests: |
|
bot: build repo:eessi.io-2025.06-software instance:eessi-bot-mc-aws for:arch=aarch64/neoverse_v1 |
|
New job on instance
|
|
@bedroge Looks like we have a winner |
Awesome, thanks a lot @Flamefire! |
|
bot: build repo:eessi.io-2025.06-software instance:eessi-bot-deucalion for:arch=aarch64/a64fx |
|
New job on instance
|
|
New job on instance
|
|
The a64fx build also ran out of memory, trying again with an updated hooks file... bot: build repo:eessi.io-2025.06-software instance:eessi-bot-deucalion for:arch=aarch64/a64fx |
|
New job on instance
|
|
@bedroge I suspect part of the memory problem is related to easybuilders/easybuild-easyblocks#4096 and we also almost certainly want easybuilders/easybuild-easyconfigs#21309 for (some?) ARM CPUs |
|
|
The build without ACL (#1389 (comment)) failed because of: |
I had ignored the last one because there is some weird timing issue: It basically starts n processes serially doing a sleep and asserting the passed time is at least n*sleeptime which fails with I can take a look at the log again or just increase allowed failures to 20 |
Generally speaking I would say that this is a very impressive result for the test suite on A64FX... We should probably take a closer look at the |
Confirmed that easybuilders/easybuild-easyblocks#4096 is enough: on a64fx with ACL as a dependency, training on CIFAR-100 (the benchmark where we originally found that ACL made a big difference) is ~4.75x times faster than without the ACL dependency |
Then we should add it as an architecture specific dependency. |
|
bot: build repo:eessi.io-2025.06-software instance:eessi-bot-deucalion for:arch=aarch64/a64fx |
|
New job on instance
|
|
New job on instance
|
|
bot: build repo:eessi.io-2025.06-software instance:eessi-bot-deucalion for:arch=aarch64/a64fx |
|
New job on instance
|
|
New job on instance
|
No description provided.